Instructions

Below you will find several empty R code scripts and answer prompts. Your task is to fill in the required code snippets and answer the corresponding questions.

Cereal Data

Today, we start by looking at a collection of breakfast cereals:

With variables:

Produce a histogram of the sugar variable.

Now, compute the standard deviation of the variable sugar:

##  [1] -1.02631579  0.97368421 -2.02631579 -7.02631579  0.97368421
##  [6]  2.97368421  6.97368421  0.97368421 -1.02631579 -2.02631579
## [11]  4.97368421 -6.02631579  1.97368421 -0.02631579  5.97368421
## [16] -4.02631579 -5.02631579  4.97368421  5.97368421 -0.02631579
## [21] -7.02631579 -4.02631579  2.97368421 -2.02631579  5.97368421
## [26]  3.97368421 -0.02631579  2.97368421  4.97368421  4.97368421
## [31]  7.97368421  1.97368421 -2.02631579 -4.02631579 -3.02631579
## [36]  3.97368421  2.97368421  3.97368421 -1.02631579  1.97368421
## [41] -4.02631579 -1.02631579  4.97368421 -4.02631579  3.97368421
## [46]  3.97368421  5.97368421 -1.02631579  1.97368421 -0.02631579
## [51] -5.02631579  2.97368421  6.97368421 -4.02631579 -7.02631579
## [56] -7.02631579 -1.02631579  4.97368421  0.97368421 -1.02631579
## [61] -5.02631579 -4.02631579 -7.02631579 -7.02631579 -7.02631579
## [66]  7.97368421 -4.02631579 -2.02631579 -4.02631579  6.97368421
## [71] -4.02631579 -4.02631579  4.97368421 -4.02631579 -4.02631579
## [76]  0.97368421

What are the units of this measurement?

Answer:grams

Now, compute the deciles of the variable score:

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
## 18.0 28.0 31.0 34.5 37.0 40.0 42.0 48.0 53.0 58.0 84.0

What is the value of the 30th percentile. Describe what this means in words:

Answer: 34.5 It means the cereal is equal to or greater than 30% of other cereals.

Produce a boxplot of score and brand.

Which brand seems to have the healthiest cereals?

Answer: Kelloggs

Produce a boxplot of score and shelf.

Produce a boxplot of sugar and shelf.

If I want a healthy but reasonably sweet cereal which shelf would be the best to look on?

Answer: Top Shelf

Tea Reviews

Next, we will take another look at a dataset of tea reviews that I used in a previous lecture:

With variables: - name: the full name of the tea - type: the type of tea. One of: - black - chai - decaf - flavors - green - herbal - masters - matcha - oolong - pu_erh - rooibos - white - score: user rated score; from 0 to 100 - price: estimated price of one cup of tea - num_reviews: total number of online reviews

Draw a scatterplot with num_reviews (x-axis) against score (y-axis) and add a regression line (recall: geom_smooth(method="lm")).

Does the score tend to increase, decrease, or remain the same as the number of reviews increases?

Answer:Increase

Calculate the ventiles of the variable price.

##     0%     1%     2%     3%     4%     5%     6%     7%     8%     9% 
##   8.00  10.00  10.00  10.00  10.00  10.00  10.00  10.00  10.00  10.00 
##    10%    11%    12%    13%    14%    15%    16%    17%    18%    19% 
##  10.00  10.00  10.00  10.00  10.00  10.00  10.00  10.00  10.00  10.00 
##    20%    21%    22%    23%    24%    25%    26%    27%    28%    29% 
##  10.00  10.00  10.00  10.00  10.00  10.00  10.00  10.00  12.00  12.00 
##    30%    31%    32%    33%    34%    35%    36%    37%    38%    39% 
##  12.00  12.00  12.00  12.00  12.00  12.00  12.00  12.00  12.00  12.00 
##    40%    41%    42%    43%    44%    45%    46%    47%    48%    49% 
##  12.00  12.00  12.00  12.00  12.00  12.00  12.00  12.00  12.00  12.00 
##    50%    51%    52%    53%    54%    55%    56%    57%    58%    59% 
##  13.00  14.00  15.00  15.00  15.00  15.00  15.00  15.00  15.00  15.00 
##    60%    61%    62%    63%    64%    65%    66%    67%    68%    69% 
##  15.00  15.00  15.00  15.00  15.00  17.00  17.00  17.00  17.32  19.00 
##    70%    71%    72%    73%    74%    75%    76%    77%    78%    79% 
##  19.00  19.00  19.00  19.00  19.38  20.00  22.00  24.49  25.00  27.46 
##    80%    81%    82%    83%    84%    85%    86%    87%    88%    89% 
##  30.00  32.00  32.00  32.00  34.00  35.35  38.64  40.00  44.56  46.93 
##    90%    91%    92%    93%    94%    95%    96%    97%    98%    99% 
##  49.30  60.00  60.00  60.82  71.02  86.75  95.04 137.38 157.00 157.00 
##   100% 
## 196.00

What is the 80th percentile? Describe it in words, include the units of the problem in your answer.

Answer: The 80th percentile is 17 dollars.

Plot the number of reviews (x-axis) against the score variable. Color the points according to price binned into 5 buckets.

What tends to be true about the number of reviews for the most expensive 20% of teas?

Answer:There are less reviews than the others have.

Create a dataset named white that consists of only white teas.

## # A tibble: 17 x 5
##    name                  type  score price num_reviews
##    <chr>                 <chr> <int> <int>       <int>
##  1 silver_needle         white    95    64         963
##  2 jasmine_silver_needle white    96    49         678
##  3 white_symphony        white    94    32         577
##  4 white_peach           white    95    19        1669
##  5 white_strawberry      white    94    19         488
##  6 white_peony           white    93    25        1113
##  7 white_blueberry       white    94    19        1353
##  8 white_eternal_spring  white    92    19         340
##  9 white_pear            white    91    19         814
## 10 white_darjeeling      white    94    46         107
## 11 white_fuzzy_navel     white    93    19          17
## 12 white_grapefruit      white    92    19         379
## 13 white_pearls          white    92    20          49
## 14 white_tropics         white    90    19         620
## 15 white_tangerine       white    90    19         495
## 16 snowbud               white    93    32         575
## 17 white_cucumber        white    88    19         521

Calculate the standard deviation of the price for white teas and the standard deviation of the price for all of the teas.

## [1] 13.59444

Is the variation of the white tea prices smaller, larger, or about the same as the entire dataset?

Answer:The variation of the white tea prices is smaller than the entire dataset.

Summarize the dataset by the type of tea and save the results as a variable named tea_type.

## # A tibble: 12 x 14
##    type   score_mean price_mean num_reviews_mean score_median price_median
##    <chr>       <dbl>      <dbl>            <dbl>        <dbl>        <dbl>
##  1 black        93.7       23.4              995         94.0         17.0
##  2 chai         93.3       12.0             1069         93.0         12.0
##  3 decaf        93.2       15.0              303         94.0         15.0
##  4 flavo…       92.0       10.0              891         92.0         10.0
##  5 green        93.0       17.9              668         93.0         12.0
##  6 herbal       93.2       11.6              916         93.0         12.0
##  7 maste…       94.6      124                115         95.0        142  
##  8 matcha       91.0       60.0              108         92.0         60.0
##  9 oolong       93.5       28.9              636         94.0         30.5
## 10 pu_erh       91.6       20.6              473         92.0         15.0
## 11 rooib…       92.3       11.7              509         92.5         10.0
## 12 white        92.7       26.9              633         93.0         19.0
## # ... with 8 more variables: num_reviews_median <dbl>, score_sd <dbl>,
## #   price_sd <dbl>, num_reviews_sd <dbl>, score_sum <int>,
## #   price_sum <int>, num_reviews_sum <int>, n <int>

Plot the average price (x-axis) against the average score (y-axis) of each type of tea. Make the size of the points proportional to the number of teas in each category and label the points with geom_text_repel and the tea type.

Describe an interesting pattern or set of outliers that you found in the previous plot. This does not need to take more than 1-2 sentences.

Answer: The set of outliers that I found were matcha and masters types of tea. They both have an increased average price, but are the opposite extremes in average score.